Briefing on the Modern ML Stack with R

Javier Luraschi, RStudio

Overview

About RStudio

RStudio’s Multiverse Team

Authors of R packages to support Apache Spark, TensorFlow and MLflow. Contributors to tidyverse and Apache Arrow.

The Modern ML Stack with R

What is Spark?

“Apache Spark™ is a unified analytics engine for large-scale data processing.”

  • Unified: Spark supports many libraries, clusters technologies and storage systems.
  • Analytics: Analytics is the discovery and interpretation of data to produce and communicate information.
  • Engine: Spark is expected to be efficient and generic.
  • Large-Scale: One can interpret large-scale as cluster-scale, a set of connected computers working together.

Why Spark?

Information grows at exponential rates.

What’s next?

We see Spark supporting multiple projects: TensorFlow, MLFlow, Modeling, etc.

Why R?

Modern R

The tidyverse is an opinionated collection of R packages designed for data science. All packages share an underlying design philosophy, grammar, and data structures.

Spark and R

In an ideal world, all R packages work with Spark, like magic. Such is the case for dplyr and sparklyr.

Timeline

2016-2019

Beyond 2020

Use Cases

Technical

Community

Thanks!